83 research outputs found
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
Clustered Hierarchical Entropy-Scaling Search of Astronomical and Biological Data
Both astronomy and biology are experiencing explosive growth of data, resulting in a âbig dataâ problem that stands in the way of a âbig dataâ opportunity for discovery. One common question asked of such data is that of approximate search (Ïânearest neighbors search). We present a hierarchical search algorithm for such data sets that takes advantage of particular geometric properties apparent in both astronomical and biological data sets, namely the metric entropy and fractal dimensionality of the data. We present CHESS (Clustered Hierarchical Entropy-Scaling Search), a search tool with virtually no loss in specificity or sensitivity, demonstrating a 13.6 Ă speedup over linear search on the Sloan Digital Sky Surveyâs APOGEE data set and a 68 Ă speedup on the GreenGenes 16S metagenomic data set, as well as asymptotically fewer distance comparisons on APOGEE when compared to the FALCONN locality-sensitive hashing library. CHESS demonstrates an asymptotic complexity not directly dependent on data set size, and is in practice at least an order of magnitude faster than linear search by performing fewer distance comparisons. Unlike locality-sensitive hashing approaches, CHESS can work with any user-defined distance function. CHESS also allows for implicit data compression, which we demonstrate on the APOGEE data set. We also discuss an extension allowing for efficient k-nearest neighbors search
Computational biology in the 21st century
Computational biologists answer biological and biomedical questions by using computation in support ofâor in place ofâlaboratory procedures, hoping to obtain more accurate answers at a greatly reduced cost. The past two decades have seen unprecedented technological progress with regard to generating biological data; next-generation sequencing, mass spectrometry, microarrays, cryo-electron microscopy, and other highthroughput approaches have led to an explosion of data. However, this explosion is a mixed blessing. On the one hand, the scale and scope of data should allow new insights into genetic and infectious diseases, cancer, basic biology, and even human migration patterns. On the other hand, researchers are generating datasets so massive that it has become difficult to analyze them to discover patterns that give clues to the underlying biological processes.National Institutes of Health. (U.S.) ( grant GM108348)Hertz Foundatio
Going the distance for protein function prediction: a new distance metric for protein interaction networks
Due to an error introduced in the production process, the x-axes in the first panels of Figure 1 and Figure 7 are not formatted correctly. The correct Figure 1 can be viewed here: http://dx.doi.org/10.1371/annotation/343bf260-f6ff-48a2-93b2-3cc79af518a9In protein-protein interaction (PPI) networks, functional similarity is often inferred based on the function of directly interacting proteins, or more generally, some notion of interaction network proximity among proteins in a local neighborhood. Prior methods typically measure proximity as the shortest-path distance in the network, but this has only a limited ability to capture fine-grained neighborhood distinctions, because most proteins are close to each other, and there are many ties in proximity. We introduce diffusion state distance (DSD), a new metric based on a graph diffusion property, designed to capture finer-grained distinctions in proximity for transfer of functional annotation in PPI networks. We present a tool that, when input a PPI network, will output the DSD distances between every pair of proteins. We show that replacing the shortest-path metric by DSD improves the performance of classical function prediction methods across the board.MC, HZ, NMD and LJC were supported in part by National Institutes of Health (NIH) R01 grant GM080330. JP was supported in part by NIH grant R01 HD058880. This material is based upon work supported by the National Science Foundation under grant numbers CNS-0905565, CNS-1018266, CNS-1012910, and CNS-1117039, and supported by the Army Research Office under grant W911NF-11-1-0227 (to MEC). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript
CLUSTERED HIERARCHICAL ANOMALY AND OUTLIER DETECTION ALGORITHMS
Anomaly and outlier detection is a long-standing problem in machine learning. In some cases, anomaly detection is easy, such as when data are drawn from well-characterized distributions such as the Gaussian. However, when data occupy high-dimensional spaces, anomaly detection becomes more difficult. We present CLAM (Clustered Learning of Approximate Manifolds), a manifold mapping technique in any metric space. CLAM begins with a fast hierarchical clustering technique and then induces a graph from the cluster tree, based on overlapping clusters as selected using several geometric and topological features. Using these graphs, we implement CHAODA (Clustered Hierarchical Anomaly and Outlier Detection Algorithms), exploring various properties of the graphs and their constituent clusters to find outliers. CHAODA employs a form of transfer learning based on a training set of datasets, and applies this knowledge to a separate test set of datasets of different cardinalities, dimensionalities, and domains. On 24 publicly available datasets, we compare CHAODA (by measure of ROC AUC) to a variety of state-of-the-art unsupervised anomaly-detection algorithms. Six of the datasets are used for training. CHAODA outperforms other approaches on 16 of the remaining 18 datasets. CLAM and CHAODA scale to large, high-dimensional âbig dataâ anomalydetection problems, and generalize across datasets and distance functions. Source code to CLAM and CHAODA are freely available on GitHub1
CLAM-Accelerated K-Nearest Neighbors Entropy-Scaling Search of Large High-Dimensional Datasets via an Actualization of the Manifold Hypothesis
Many fields are experiencing a Big Data explosion, with data collection rates
outpacing the rate of computing performance improvements predicted by Moore's
Law.
Researchers are often interested in similarity search on such data.
We present CAKES (CLAM-Accelerated -NN Entropy Scaling Search), a novel
algorithm for -nearest-neighbor (-NN) search which leverages geometric
and topological properties inherent in large datasets.
CAKES assumes the manifold hypothesis and performs best when data occupy a
low dimensional manifold, even if the data occupy a very high dimensional
embedding space.
We demonstrate performance improvements ranging from hundreds to tens of
thousands of times faster when compared to state-of-the-art approaches such as
FAISS and HNSW, when benchmarked on 5 standard datasets.
Unlike locality-sensitive hashing approaches, CAKES can work with any
user-defined distance function.
When data occupy a metric space, CAKES exhibits perfect recall.Comment: As submitted to IEEE Big Data 202
Autism and the broad autism phenotype: familial patterns and intergenerational transmission
Abstract Background Features of the Broad Autism Phenotype (BAP) are disproportionately prevalent in parents of a child with autism, highlighting familial patterns indicative of heritability. It is unclear, however, whether the presence of BAP features in both parents confers an increased liability for autism. The current study explores whether the presence of BAP features in two biological parents occurs more frequently in parents of a child with autism relative to comparison parents, whether parental pairs of a child with autism more commonly consist of one or two parents with BAP features, and whether these features are associated with severity of autism behaviors in probands. Method Seven hundred eleven parents of a child with an autism spectrum disorder and 981 comparison parents completed the Broad Autism Phenotype Questionnaire. Parents of a child with autism also completed the Social Communication Questionnaire. Results Although parental pairs of a child with autism were more likely than comparison parental pairs to have both parents characterized by the presence of the BAP, they more commonly consisted of a single parent with BAP features. The presence of the BAP in parents was associated with the severity of autism behaviors in probands, with the lowest severity occurring for children of parental pairs in which neither parent exhibited a BAP feature. Severity did not differ between children of two affected parents and those of just one. Conclusions Collectively, these findings indicate that parental pairs of children with autism frequently consist of a single parent with BAP characteristics and suggest that future studies searching for implicated genes may benefit from a more narrow focus that identifies the transmitting parent. The evidence of intergenerational transmission reported here also provides further confirmation of the high heritability of autism that is unaccounted for by the contribution of de novo mutations currently emphasized in the field of autism genetics
MEDFORD: A HUMAN AND MACHINE READABLE METADATA MARKUP LANGUAGE
Reproducibility of research is essential for science. However, in the way modern computational biology research is done, it is easy to lose track of small, but extremely critical, details. Key details, such as the specific version of a software used or iteration of a genome can easily be lost in the shuffle, or perhaps not noted at all. Much work is being done on the database and storage side of things, ensuring that there exists a space to store experiment-specific details, but current mechanisms for recording details are cumbersome for scientists to use. We propose a new metadata description language, named MEDFORD, in which scientists can record all details relevant to their research. Human-readable, easily-editable, and templatable, MEDFORD serves as a collection point for all notes that a researcher could find relevant to their research, be it for internal use or for future replication. MEDFORD has been applied to coral research, documenting research from RNA-seq analyses to photo collections
Hidden Subluminous sd/wd among the FAUST UV sources toward OPHIUCHUS
A UV image in the direction of Ophiuchus, obtained with the FAUST instrument
is analysed. Suitable candidates as unrecognized subluminous stars are selected
comparing the observed UV flux to the predicted one. The uv-excess objects were
observed at the 1.0 m Wise telescope. This method yields to the detection of
eight broad Balmer lines objects. Six are classified as sds and two wds,
comparing the Hbeta line profile with that of stellar model atmospheres.Comment: 2 pages, including 2 figures. To appear in the Proceedings of the
13th European Workshop on White Dwarfs. NATO Science Series II, Kluwer
Academic Publishe
- âŠ